Preserve SpeechLM perception checkpoint dtype#15686
Merged
pzelasko merged 12 commits intoMay 13, 2026
Merged
Conversation
Avoid forcing the SpeechLM audio perception module to FP32 during vLLM inference so BF16 checkpoints can run the encoder in their stored dtype while keeping raw audio preprocessing in FP32. Signed-off-by: Dongji Gao <dongjig@nvidia.com>
Move the processed-feature dtype cast out of the shared perception module and into the SpeechLM vLLM model path so this fix remains scoped to plugin inference. Signed-off-by: Dongji Gao <dongjig@nvidia.com>
Keep the dtype conversion scoped to the SpeechLM vLLM plugin path and leave the shared perception module unchanged. Signed-off-by: Dongji Gao <dongjig@nvidia.com>
e369a20 to
36c35e2
Compare
pzelasko
reviewed
May 11, 2026
Collaborator
pzelasko
left a comment
There was a problem hiding this comment.
Great fix but I think there is too much defensive dtype casting, can we minimize to the absolutely necessary ones only?
Call the audio preprocessor directly before casting features to the perception encoder dtype, keeping the dtype fix scoped to the plugin inference path. Signed-off-by: Dongji Gao <dongjig@nvidia.com>
Keep the perception module in the checkpoint dtype while loading the original tensors directly. Signed-off-by: Dongji Gao <dongjig@nvidia.com>
Preserve the existing BF16 LLM boundary cast and keep the PR focused on avoiding FP32 perception weights. Signed-off-by: Dongji Gao <dongjig@nvidia.com>
pzelasko
reviewed
May 11, 2026
Keep raw audio preprocessing in FP32 and run perception in BF16 for the vLLM plugin path without extra defensive dtype detection. Signed-off-by: Dongji Gao <dongjig@nvidia.com>
Perception outputs already follow the plugin perception dtype, so avoid an extra cast before returning audio embeddings. Signed-off-by: Dongji Gao <dongjig@nvidia.com>
Rely on AudioPerceptionModule to handle preprocessing and encoder handoff after the plugin sets the perception module dtype. Signed-off-by: Dongji Gao <dongjig@nvidia.com>
pzelasko
approved these changes
May 11, 2026
Collaborator
|
/ok to test 82fa4af |
Collaborator
|
/ok to test 8653bac |
Contributor
Author
|
/ok to test 83974ea |
Contributor
|
[🤖]: Hi @DongjiGao 👋, We wanted to let you know that a CICD pipeline for this PR just finished successfully. So it might be time to merge this PR or get some approvals. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Test plan
python3 -c "import ast; ast.parse(open('/home/dongjig/NeMo_merge/nemo/collections/speechlm2/vllm/salm/model.py').read()); ast.parse(open('/home/dongjig/NeMo_merge/nemo/collections/speechlm2/modules/perception.py').read())"/data/dongjig/results/quantization/speechlm_bf16_perception_checkpoint_dtype_voxpopuli_20260511_095950/result.json./data/dongjig/results/quantization/speechlm_leaderboard_perception_dtype_20260510_150218; no meaningful WER regression observed for BF16 perception.